The files has been split into Task1 and one for Task 2, 3 and 4. This has been done after we recieved an email telling us we needed this to deliver. The Full project is the unedited version.
If you prefer to use the GitHub hosted files, then use these links
We have now been hired as analysts at Insight Analytics (IA), and the company that “owns” our data wants to develop a system that can report sales for all their outlets in the group. To demonstrate what we can deliver, they have announced a tender competition, and our task is to participate in the competition by solving a series of tasks.
The first task we are to undertake is to write code in R that combines the four datasets they have given us into one large dataset.
Code
rm(list=ls()) #cleaning our environment# Load required librarieslibrary(tidyverse)library(haven)library(curl)library(utils)library(janitor)library(glue)library(leaflet)library(knitr)library(plotly)# Code to be used to see what type of category we get to work withcategory <-c("Analgesics","Bath Soap","Beer","Bottled Juices","Cereals","Cheeses","Cigarettes","Cookies","Crackers","Canned Soup","Dish Detergent","Front-end-candies","Frozen Dinners","Frozen Entrees","Frozen Juices","Fabric Softeners","Grooming Products","Laundry Detergents","Oatmeal","Paper Towels","Soft Drinks","Shampoos","Snack Crackers","Soaps","Toothbrushes","Canned Tuna","Toothpastes","Bathroom Tissues")letter2number <-function(x) {utf8ToInt(x) -utf8ToInt("A") + 1L}seed_number <-sum(letter2number("Daniel")) +sum(letter2number("Daniel"))set.seed(seed_number)cat(glue("Our seed number is {seed_number} so our category is {sample(category, 1)}")) #Making a print function to display nicer in html
Our seed number is 410 so our category is Shampoos
Code
# We free up ram by removing old dataframes we dont need to keep loaded.rm(category, letter2number, seed_number)
We start by loading the packages we will use to do the tasks. Then we run the code that ensures we know which product category we are given with our “seed” which is Shampoo.
Code
# Storing the urls in a listurls <-c("https://www.chicagobooth.edu/boothsitecore/docs/dff/store-demos-customer-count/ccount_stata.zip","https://www.chicagobooth.edu/boothsitecore/docs/dff/store-demos-customer-count/demo_stata.zip","https://www.chicagobooth.edu/-/media/enterprise/centers/kilts/datasets/dominicks-dataset/movement_csv-files/wsha.zip")# Storing the zip files in a listzip_files <-c("ccount_stata.zip", "demo_stata.zip", "wsha.zip")# Looping through the URLs and download and impotr the filesfor (i inseq_along(urls)) {#Downloading the zip filecurl_download(urls[i], zip_files[i])# Unzipping file in loopunzip(zip_files[i], exdir ="data")# Finding the .dta and .csv files and calling it stata_file stata_file <-c(list.files("data", pattern =".*\\.dta", full.names =TRUE),list.files("data", pattern =".*\\.csv", full.names =TRUE))# Loop to create dfsfor (stata_file in stata_file) {if (grepl("ccount.dta", stata_file)) { ccount_stata <-read_stata(stata_file) } elseif (grepl("demo.dta", stata_file)) { demo_stata <-read_stata(stata_file) }elseif (grepl("wsha.csv", stata_file)) { wsha <-read_csv(stata_file) } }}#Reading csv for dfupcsha <-read_csv("https://www.chicagobooth.edu/-/media/enterprise/centers/kilts/datasets/dominicks-dataset/upc_csv-files/upcsha.csv")#Finally we can delete and unlink the files now thats imported into R.file.remove(zip_files)
[1] TRUE TRUE TRUE
Code
unlink("data", recursive =TRUE)#We free up ram by removing old dataframes we dont need to keep loaded.rm(urls, zip_files, stata_file, i)
This code downloads and imports data files from different websites and unpacks them into a folder. It then identifies the .dta and .csv files, and creates data frames from these files. Finally, it downloads a CSV file from another website and creates a final data frame from this file. All this is done using loops, so it does not have to be done manually each time. In the end, the code deletes the temporary files that were downloaded and unpacked.
Code
#Defining the start date of week 1week_1_start <-as.Date("1989-09-14") #Please see appendix for usage of AI#CLEANING DATA BLOCKccount_stata <-remove_empty(ccount_stata, which ="cols") #Removing all columns with only NAccount_stata2 <- ccount_stata %>%mutate(date =ymd(date)) %>%#Forcing date onto date column making NAs where there are invalid datesfilter(!is.na(date)) %>%#Dropping NA rowsfilter(date >=as.Date("1992-02-20") & date <=as.Date("1999-12-31")) %>%#We only have data from 1992 so we filter for thatmutate(week =as.integer((date - week_1_start) /7) +1) %>%#Creating week column. Please see appendix for usage of AIselect(week, store, custcoun, haba) %>%#Grabbing only the data we wantgroup_by(week, store) %>%#grouping by to aggregate to weeklysummarise_all(sum) #aggregating to weekly so we can merge the dataframes
This code cleans and processes the “ccount_stata” dataset. First, it removes all columns that only contain missing values. Then it converts the “date” column to date format, and drops rows that have missing values in this column. It also filters the data so that only data from the period from February 20, 1992 to December 31, 1999 is retained. It then creates a new column called “week” that represents what week of the year it is. Then it selects only the columns that are relevant for further analysis. Then the data is grouped by week and store, and the numerical values are summed for each group. The result is aggregated to weekly sales data for each store in the period from 1992 to 1999.
Code
#Copied the table from the PDF and pasted it into chatGPT to have it write out all the variables for me, i then put it into this list. I did the same with Description. variable_names <-c("age9", "age60", "ethnic", "educ", "nocar", "income", "incsigma", "hsizeavg", "hsize1", "hsize2", "hsize34", "hsize567", "hh3plus", "hh4plus", "hhsingle", "hhlarge", "workwom", "sinhouse", "density", "hval150", "hval200", "hvalmean", "single", "retired", "unemp", "wrkch5", "wrkch17", "nwrkch5", "nwrkch17", "wrkch", "nwrkch", "wrkwch", "wrkwnch", "telephn", "mortgage", "nwhite", "poverty", "shopcons", "shophurr", "shopavid", "shopstr", "shopunft", "shopbird", "shopindx", "shpindx") #Please see appendix for AI usagenavn_for_demo_stata <-tibble(variable_name =c("age9", "age60", "ethnic", "educ", "nocar", "income", "incsigma", "hsizeavg", "hsize1", "hsize2", "hsize34", "hsize567", "hh3plus", "hh4plus", "hhsingle", "hhlarge", "workwom", "sinhouse", "density", "hval150", "hval200", "hvalmean", "single", "retired", "unemp", "wrkch5", "wrkch17", "nwrkch5", "nwrkch17", "wrkch", "nwrkch", "wrkwch", "wrkwnch", "telephn", "mortgage", "nwhite", "poverty", "shopcons", "shophurr", "shopavid", "shopstr", "shopunft", "shopbird", "shopindx", "shpindx"),description =c("% Population under age 9", "% Population over age 60", "% Blacks & Hispanics", "% College Graduates", "% With No Vehicles", "Log of Median Income", "Std dev of Income Distribution (Approximated)", "Average Household Size", "% of households with 1 person", "% of households with 2 persons", "% of households with 3 or 4 persons", "% of households with 5 or more persons", "% of households with 3 or more persons", "% of households with 4 or more persons", "% of households with 1 person", "% of households with 5 or more persons", "% Working Women with full-time jobs", "% Detached Houses", "Trading Area in Sq Miles per Capita", "% of Households with Value over $150,000", "% of Households with Value over $200,000", "Mean Household Value (Approximated)", "% of Singles", "% of Retired", "% of Unemployed", "% of working women with children under 5", "% of working women with children 6 - 17", "% of non-working women with children under 5", "% of non-working women with children 6 - 17", "% of working women with children", "% of non-working women with children", "% of working women with children under 5", "% of working women with no children", "% of households with telephones", "% of households with mortgages", "% of population that is non-white", "% of population with income under $15,000", "% of Constrained Shoppers", "% of Hurried Shoppers", "% of Avid Shoppers", "% of Shopping Stranges", "% of Unfettered Shoppers", "% of Shopper Birds", "Ability to Shop (Car and Single Family House)", "Ability to Shop (Car and Single Family House)") #please see appendix for AI usage)# Creating a variable_list to filter out only the variables we wantvariable_list <-c("hvalmean", "incsigma", "income", "nocar", "sinhouse", "shopindx", "shpindx", "hh3plus", "hh4plus", "hhsingle", "hhlarge", "density", "wrkch5", "nwrkch5", "wrkwch", "telephn")#We then selrct what to keep outside of variable_list and what to keep inside of variable_listdemo_stata <- demo_stata %>%select(name, city, zip, lat, long, store, variable_names[variable_names %in%names(demo_stata)]) %>%select(-all_of(variable_list)) #We then remove to free up ramrm(variable_list, variable_names)
Then, the names of the variables and their descriptions are stored in two separate lists. After that, a new table is created with columns for variable names and descriptions.
Next, a list of variable names is created to filter out only the variables that are desired. Then, the data table is filtered by retaining columns that are included in the variable name list but excluding columns that are included in another list called “variable_list”.
Code
# Set the scipen option to 999 to avoid scientific notation when printing large numbersoptions(scipen =999)#Dropping NAdemo_stata <- demo_stata %>%drop_na()# We want to make figures, but the dataframe does not have data to summarize some of the percentages to 100 so we have to calculate this in the mutate functions to find the missing percentage, we store those in the variables agebetween, % white etc.demo_stata <- demo_stata %>%mutate("% of population that is between the age of 9 and 60"=1- age9-age60) %>%mutate("% of population that is white"=1-nwhite) %>%mutate("% of population that is neither black & hispanics or white"= nwhite-ethnic) %>%mutate("% of people that arent college graduates"=1-educ) %>%mutate("% of women that arent working"=1-workwom) %>%mutate("% of households without mortages"=1-mortgage) %>%mutate("% of population with income over $15,000"=1-poverty) # Pivot the demo_stata data frame to a longer formatdf_long <- demo_stata %>%pivot_longer(cols =-c(name, city, zip, lat, long, store),# Specifing the columns not to pivotnames_to ="variable_name",# Set the new column name for the valuesvalues_to ="value")# Set the new column name for the values# Fixing so that the variable names become the description instead of the shorcuts, it looks much better and gives more informationdf_long <- df_long %>%left_join(navn_for_demo_stata, by ="variable_name") %>%mutate(variable_name =ifelse(is.na(description), variable_name, description)) %>%select(-description)# We then remove the dataframe we dont need anymorerm(navn_for_demo_stata)
Here, the scipen option is set to 999 to avoid scientific notation when large numbers are printed out, and the drop_na function is used to remove NAN values from the dataset. We use the mutate function to calculate the final percentage values to sum the percentages to 100.
Finally, in this block of code, we make the dataset longer, and arrange for the variable names to be replaced with the description of what they do so we get more information. In the end, we remove the description and a dataset we don’t need further.
Code
# Merge the movement file with the UPC filedf <- wsha %>%select(-SALE) %>%#Dropping SALE that indicates if it was a saleinner_join(upcsha, by ="UPC") %>%#inner join for only taking the UPCclean_names() %>%#Using janitor to clean the namesmutate(date = week_1_start + (week -1) *7) %>%select(date, everything()) %>%#and we sort it to have date first filter(move >0, ok ==1) %>%#Using only aggregated sales and only valid datamutate(sales = price * move / qty) %>%#calculating total dollar sales select(-price_hex, -profit_hex, -ok) #dropping hex values for price and profitdf <- df %>%inner_join(ccount_stata2, by =c("week", "store")) final_df <- df %>%filter(date >="1993-12-30"& date <="1995-01-03") %>%#Filtering for only data in 1994 since we lack data for 1990-1993select(-com_code, -upc, nitem)#We free up ram by removing old dataframes we dont need to keep loaded.rm(demo_stata, df, ccount_stata, ccount_stata2, upcsha, wsha, week_1_start)
The code begins by performing a merge of two datasets using UPC as the key. Then the column indicating whether it was a sales transaction, “SALE”, is removed and the column names are tidied up using the janitor package. Afterwards, a new column called “date” is created using “week_1_start” and the number of weeks. Then, the dataset is filtered to only include aggregate sales and valid data, and a new dataset is created by filtering data from the year 1994. Finally, memory is freed by removing the datasets that are no longer necessary.
Code
#creating a list of brands to use in the loopbrands <-c("vo5", "spirit", "bold_hold", "head_and_shoulders", "ivory")#Here we had the help of ChatGPT to create the REGEX code#Creating a list of patterns to look for to get every type of different name of the shampoospatterns <-c("V\\s*[-|]?\\s*O?\\s*5", "(teen|TN)\\s*sp[i|]r[i|]t", "BOLD\\s*HOLD", "(H&S|HEAD\\s*&\\s*SHOULDERS|HEAD\\s*&\\s*SHLDRS?|HD\\s*&\\s*SHLDRS?|H\\s*&\\s*S)", "IVORY") #please see appendix for AI usage# Storing this in resultsresults <-list()# Loop for taking out the shampoo brands we want in the final dataframefor (i inseq_along(brands)) { data <- final_df %>%# we filter for the patterns in the loopfilter(str_detect(descrip, regex(patterns[i], ignore_case =TRUE))) %>%# group the filtered observationsgroup_by(date, store, week) %>%# Summarize it only if it is numericsummarise_if(is.numeric, sum, na.rm =TRUE) %>%# mutate it to store these in different dataframes for the type of brandmutate(brand = brands[i])# then store this in the empty list results results[[brands[i]]] <- data}# Removing to free up ramrm(brands,i,patterns, final_df)
The code creates a list of different shampoo brands, and a list of different patterns to find the names of these brands. Then, the code creates an empty list called results. A loop goes through each brand and uses the patterns to filter and group data in final_df for each brand, then adds the results to the results list.
Code
df <- results[["vo5"]] %>%bind_rows(results[["spirit"]]) %>%#Combining the rows from the other dataset into df to make one final dataframe.bind_rows(results[["bold_hold"]]) %>%bind_rows(results[["head_and_shoulders"]]) %>%bind_rows(results[["ivory"]])# Demo_stata finally joins the complete and finished dfdf <- df %>%left_join(df_long, by="store") %>%rename(dflong_value = value) %>%select(-nitem, -case)df <- df %>%pivot_wider(names_from ="variable_name", values_from = dflong_value) %>%select(-47) #Removing NA col# Remove the last dataframerm(results,data,df_long)write.csv(df, "sok-1005_data.csv", row.names =FALSE) #We write the dataframe to a csv file. rownames=false means that the rownames will not be written to the file.
The code first takes the different data frames from the various brands, and binds them together into a final data frame (df). Then, the code performs a left join with another data frame (df_long) and renames the ‘value’ column in df_long to ‘dflong_value’ to avoid confusion. We widen the dataframe to reduce the amount of the space required for the file. The code writes the ‘df’ data frame to a csv file called “sok-1005_data.csv” without row names. Finally, the last data frames are removed from RAM to free up memory. The data manipulation is done, we have combined the four datasets into one large one, and it is ready to be used for figures and further analysis.
Task 2
We have chosen the specific brands because some of the brands left the marked and some came into the marked over the years and 1994 was the most consistent for the brands we had chosen.
We run a code that groups all the stores and sums across all their customers, then we filter out only the outlet that has the most customers, this is Dominicks 98 which we will look further into and store it in a separate dataset.
Code
tryCatch({df <-read_csv("sok-1005_data.csv")}, error =function(err) {df <<-read_csv("https://raw.githubusercontent.com/Danieljoha/Sok-1005-Data_science_project/main/sok-1005_data.csv")})df <- df %>%pivot_longer(cols =-c(date, store, week, move, qty, price, profit, sales, custcoun, haba, brand, name, city, zip, lat ,long),# Specifing the columns not to pivotnames_to ="variable_name",# Set the new column name for the valuesvalues_to ="dflong_value")#lager demo_stata igjendemo_stata <- df %>%ungroup() %>%select(-date, -brand) %>%group_by(store, name, city, zip, lat, long, variable_name, dflong_value) %>%summarise(across(1:8, \(x) sum(x, na.rm =FALSE)), .groups ="keep") %>%ungroup() %>%select(-week)df <- df %>%filter(variable_name %in%c("% Population under age 9", NA)) %>%select(-name, -city, -zip, -lat, -long, -variable_name, -dflong_value)
Code
# Filtering for the most relevant store to plotdominicks_98 <- df %>%group_by(store) %>%summarise(total_custcoun =sum(custcoun)) %>%filter(total_custcoun ==max(total_custcoun)) %>%left_join(df, by ="store") %>%select(-total_custcoun)# Pivoting to use facet wrap to show more plotsdominicks_98_long <- dominicks_98 %>%pivot_longer(cols =c(sales, haba,custcoun, profit), names_to ="names", values_to ="values")
Code
# Starting the plotdominicks_98_long %>%ggplot(aes(x = date, y = values, fill = names)) +geom_col() +# using col for bar charttheme_minimal()+# themelabs(x ="", y ="Verdier", # changing some namesfill ="names", labels="", title="Dominicks 98") +scale_fill_manual(values =c("custcoun"="#276DB6", # Changing colors manually"haba"="#B627A5","profit"="cornflowerblue","sales"="purple"))+# facet for 4 different figures in same plotfacet_wrap(~ names, scales="free", labeller =labeller(names =c("custcoun"="Customer count", "haba"="Total cosmetic sales", "profit"="Profit", "sales"="Sales")))+# Setting some adjustments in the theme for positions of title and legendtheme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.4, size=20), strip.text = (element_text(size =17)))+# Fixing the x-axis to show monthsscale_x_date(date_breaks ="1 month", date_labels ="%b")
Here you see weekly customer counts for Dominicks 98 which has the most customers of all the Dominicks stores in Chicago. Dominicks 98 was closed in October 1994 so we don’t have data for this month but it was still the best performing store. Further, we can see total cosmetic sales, which is the weekly turnover for shampoo in this store. Gross profit is the difference in profit between purchase price and selling price. Total cosmetic sales and gross profit are in dollars. Sales is a variable that we created via the recipe in the manual, according to the manual this shows us total dollar sales including things that are in bundles.
Code
# We ended up not using this figure, if you wondered why there is a code-fold here#| column: pagebrand_colors <-c("Bold Hold"="#1f77b4","Head and Shoulders"="#ff7f0e","Ivory"="#2ca02c","Spirit"="#d62728","Vo5"="#9467bd")old_fig <- dominicks_98 %>%mutate(haba = haba /1000,brand =case_when( brand =="bold_hold"~"Bold Hold", brand =="head_and_shoulders"~"Head and Shoulders", brand =="ivory"~"Ivory", brand =="spirit"~"Spirit", brand =="vo5"~"Vo5" )) %>%plot_ly() %>%add_trace(x =~date, y =~haba, type ='bar', split =~brand, color =~brand, colors = brand_colors,text =~brand, hovertemplate ="Brand: %{text}<br>Value: %{y}<extra></extra>") %>%layout(title =list(text ="Dominicks 98", xref ="paper", x =0.4),xaxis =list(title ="", tickformat ="%b", dtick ="M1"),yaxis =list(title ="Verdier i dollar(1000)"),legend =list(orientation ="h", x =0.5, y =-0.1, xanchor ="center"),barmode ='stack',annotations =list(list(text ="Weekly sales for shampoo", xref ="paper", yref ="paper", x =0.4, y =0.99, showarrow = F, font =list(size =12)) )) %>%config(displayModeBar =FALSE)#old_fig
Code
# Defining colors for each branddominicks_98_colors <- dominicks_98 %>%mutate(brand =case_when( brand =="bold_hold"~"Bold Hold", brand =="head_and_shoulders"~"Head and Shoulders", brand =="ivory"~"Ivory", brand =="spirit"~"Spirit", brand =="vo5"~"Vo5") ) %>%mutate(color = brand_colors[brand]) %>%mutate(week =as.integer(format(date, "%W")))dominicks_98_plot <- dominicks_98_colors %>%split(.$brand) %>%map(~ggplot(data = .x, aes(x = week, y = haba /1000, fill = brand,text =paste0("Week: ", week, "<br>","Brand: ", brand, "<br>","Health and beauty sales in thousands: <br>$", haba /1000))) +geom_col() +scale_fill_manual(values =unique(.x$color)) +theme_minimal() +labs(x ="Week", y ="Dollar sales for each brand (1000)") +scale_x_continuous(expand =c(0, 0), breaks =seq(1, 52, by =1), labels =seq(1, 52, by =1)) +theme(axis.text.x =element_text(angle =45, hjust =1, size =7), plot.title =element_text(hjust =0.4)) +ggtitle(paste(unique(.x$brand))) )# Convert ggplot objects to Plotly plotsdominicks_98_plotly <- dominicks_98_plot %>%map(~ggplotly(.x, tooltip ="text"))# Print each modified plotdominicks_98_plotly[[1]]dominicks_98_plotly[[2]]dominicks_98_plotly[[3]]dominicks_98_plotly[[4]]dominicks_98_plotly[[5]]
Code
brand_list <-c("Bold Hold", "Head and Shoulders", "Ivory", "Spirit", "Vo5")brand_map <-c("bold_hold"="Bold Hold", "head_and_shoulders"="Head and Shoulders","ivory"="Ivory","spirit"="Spirit","vo5"="Vo5")for (brand in brand_list) { original_brand_name <-names(brand_map)[brand_map == brand] # get the original brand name brand_data <- dominicks_98 %>%mutate(haba = haba /1000,brand =case_when( brand == original_brand_name ~ brand )) %>%filter(brand == brand)plot_ly(brand_data) %>%add_trace(x =~date, y =~haba, type ='bar', color =~brand, colors = brand_colors[brand],text =~brand, hovertemplate =paste("Brand: ", brand, "<br>Value: %{y}<extra></extra>")) %>%layout(title =list(text =paste("Dominicks 98 - ", brand), xref ="paper", x =0.4),xaxis =list(title ="", tickformat ="%b", dtick ="M1"),yaxis =list(title ="Verdier i dollar(1000)"),legend =list(orientation ="h", x =0.5, y =-0.1, xanchor ="center"),annotations =list(list(text =paste("Weekly sales for", brand), xref ="paper", yref ="paper", x =0.4, y =0.99, showarrow = F, font =list(size =12)) )) %>%config(displayModeBar =FALSE)}
In figure 2, we look at which brand sells best among the shampoos we have selected with a lot of movement. You can see that Vo5 clearly has the largest sales volume measured in dollars, and that Head and Shoulders is not as big as they are today.
Code
# List of variables of interestvariables2 <-list(c('% of population that is white','% of population that is neither black & hispanics or white', '% Blacks & Hispanics'),c("% of people that arent college graduates", "% College Graduates"),c("% of population with income under $15,000", "% of population with income over $15,000"),c("% of households without mortages", "% of households with mortgages"))# List to store the resulting datasetsdatasets <-list()# Loop through the variables and create the corresponding datasetsfor (i inseq_along(variables2)) { datasets[[i]] <- demo_stata %>%filter(store =="98") %>%unique() %>%filter(variable_name %in% variables2[[i]])}
In figure 3, the group management can see which customer group shops the most at Dominicks 98, those who contribute the most to sales in this particular store are 81% white, 1.2% other, and 17.8% black or Latin American. Further in figures 4, 5 and 6, we look at whether the customers are students, above the poverty line, and whether the customers have mortgages.
Code
variable_names <-list(c('% of population that is white','% of population that is neither black & hispanics or white', '% Blacks & Hispanics'),c("% of people that arent college graduates", "% College Graduates"),c("% of population with income under $15,000", "% of population with income over $15,000"),c("% of households without mortages", "% of households with mortgages"))create_pie_chart <-function(dataset, variable_names) { pie_data <- dataset %>%filter(variable_name %in% variable_names) %>%mutate(display_value =round(dflong_value *100, 2),custom_text =paste0(variable_name, ": ", display_value, "%")) colors <-c('blueviolet', '#B627A5', 'cornflowerblue') pie_chart <-plot_ly(pie_data, labels =~variable_name, values =~display_value, type ='pie', textinfo ='percent', marker =list(colors = colors), text =~custom_text, hovertemplate ="%{label}: <br>%{value}%<extra></extra>") %>%layout(title ="", legend =list(orientation ="h", x =0.5, y =-0.3, xanchor ="center")) %>%config(displayModeBar =FALSE)return(pie_chart)}plots <-pmap(list(datasets, variable_names), create_pie_chart)
Code
plots[[1]] plots[[2]] plots[[3]] plots[[4]]
Task 3
It is important to note that because it is weekly data we have redefined our week 225 to start at 1994. Because we are using monthly data now it shouldn’t make a large difference but the data now includes the final 4 days from 1993 and doesnt include the last 4 days of 1994. This means January had a few days of data from December 1993, February has a few days of January and so on.
First, we will present monthly values for all Dominick’s stores from the entire chain. We can see these in the figure below.
Code
df_monthly %>%pivot_longer(cols =c(sales, haba,custcoun, profit), names_to ="names", values_to ="values") %>%ggplot(aes(x = month, y = values, fill = names)) +geom_col() +# using col for bar charttheme_minimal()+# themelabs(x ="", y ="Verdier", # changing some namesfill ="names", labels="", title="All Dominicks stores. Monthly values") +scale_fill_manual(values =c("custcoun"="#276DB6", # Changing colors manually"haba"="#B627A5","profit"="cornflowerblue","sales"="purple"))+# facet for 4 different figures in same plotfacet_wrap(~ names, scales="free", labeller =labeller(names =c("custcoun"="Customer count", "haba"="Total cosmetic sales", "profit"="Profit", "sales"="Sales")))+# Setting some adjustments in the theme for positions of title and legendtheme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.4, size=20),strip.text =element_text(size =15))+scale_x_continuous(expand=c(0,0), n.breaks =12)
Next, we look at which brands are doing well. As in the previous task, we can see that Vo5 is selling the best across all of Chicago. NB: The y-axis has different scales on the different plots.
Code
# Define colors for each brandbrand_colors <-c("Bold Hold"="#1f77b4","Head and Shoulders"="#ff7f0e","Ivory"="#2ca02c","Spirit"="#d62728","Vo5"="#9467bd")df_monthly_with_colors <- df_monthly %>%mutate(brand =case_when( brand =="bold_hold"~"Bold Hold", brand =="head_and_shoulders"~"Head and Shoulders", brand =="ivory"~"Ivory", brand =="spirit"~"Spirit", brand =="vo5"~"Vo5")) %>%# Add a color column based on brandmutate(color = brand_colors[brand])plots_monthly <-df_monthly_with_colors %>%split(.$brand) %>%map(~ggplot(data = .x, aes(x = month, y = haba/1000, fill = color)) +geom_col() +scale_fill_identity() +theme_minimal() +scale_x_continuous(expand =c(0,0), n.breaks =12) +labs(x="",y="Dollar sales for each brand(1000)")+theme(plot.title =element_text(hjust =0.4))+ggtitle(paste(unique(.x$brand))))# Print each plotinvisible(map(plots_monthly, print))
The data we have can be used to find new and suitable retail outlets. What we can do, for example, is to look at the best retail outlets we already have over time in aggregate, then take a regression analysis of, for example, the 5 or 10 best places and see if the customer flow has changed positively over time. If we then manage to find a location that is popular and has an increasing customer group, we can see if there are any locations nearby where we can set up a new store to cater to the increasing customer flow.
Code
#making an empty listmodel_list <-list()#looping to add to the lsitfor (store_id inunique(df$store)) {#Subset the data for the current store store_data <-subset(df, store == store_id)#fitting the model model <-lm(custcoun ~ date, data = store_data)#adding the model and coefficients into the list model_list[[as.character(store_id)]] <-list(model = model, coef =coef(model))}#Getting the coefficients and r-squared values for all storescoef_df <-data.frame(do.call(rbind, lapply(model_list, function(x) x$coef)))
Shows top 10 stores with increasing customer trend.
# Plotting the most promising storedf %>%filter(store ==8) %>%ggplot(aes(x=date,y=custcoun)) +geom_point()+geom_smooth(method=lm, se=FALSE)+labs(x="", y="Customers", title="Dominicks 8 customers regression")+theme_minimal()+scale_x_date(date_breaks ="1 month", date_labels ="%b")
The results from the regression analysis show that store 8 has the most increasing customer flow. If we look at the map above, this location has good coverage of Dominicks stores. What we should be able to see is what this store does better than the others we have in the area already, and if a new store is to be set up, it should be further east. Otherwise, when we look at the map above, the location that could have more stores is West Chicago, for example with Winfield.
Dominicks 74 is the store with the most declining customers.
Code
table <- coef_df %>%arrange(-date) %>%rownames_to_column(var ="store") %>%mutate(store =as.numeric(store))palette <-colorNumeric(palette =c("red", "green"), domain = loc_data$sales)loc_data %>%mutate(lat = lat/10000, long = long/10000*-1) %>%leaflet() %>%addTiles(options =tileOptions(minZoom=9, maxZoom=13)) %>%addCircleMarkers(clusterOptions =markerClusterOptions(maxClusterRadius =20),lat =~lat, lng =~long, #locations for markersradius =12, #circle sizecolor =~palette(sales), #color of circle based on salesfillOpacity =0.5,stroke =FALSE, #no outlinelabel =~as.character(store), #store labelslabelOptions =labelOptions(noHide =TRUE, direction ="center", textOnly =TRUE, fontSize =16, fontWeight ="bold"), popup =paste0("<strong>Store:</strong>", loc_data$store, "<br><strong>Name:</strong>", loc_data$name, "<br><strong>City:</strong>", loc_data$city, "<br><strong>Zip:</strong>", loc_data$zip,"<br><strong>Profit from shampoo sales:</strong> $", loc_data$profit,"<br><strong>Total shampoo sales:</strong> $", loc_data$sales,"<br><strong>Customer count</strong>(thousands): ", round(loc_data$custcoun/1000),"<br><strong>Health and beauty sales</strong>(thousands): ", round(loc_data$haba/1000))) %>%addLegend(pal = palette, values =~sales, title ="Shampoo total<br> dollar sales ($)")
On this interactive map, we can finally see that it is primarily the central Chicago stores that are doing the best and those on the outskirts that are doing the worst. Our conclusion is that it is in those areas in the middle of Chicago with the least coverage where it will be most profitable to set up a new store or multiple new stores. And by running a code that we had help from ChatGPT to see there might be some form of correlation. However for the correlation numbers it is important to note that statsistical significance is important when interpreting correlations, the correlation may be not zero but if its not statistically significant then it may not be meaningful to look at. We have not used any form of statistical test as we lack the knowledge to do this in a meaningful way but we felt it may be relevant to look at.
Code
correlations <- demo_stata %>%pivot_wider(names_from = variable_name, values_from = dflong_value) #This code is with the use of ChatGPT# List of demographic columnsdemographic_cols <-colnames(correlations)[5:6]# Initialize an empty vector to store correlation resultscorrelation_results <-c()#This code is with the use of ChatGPT# Calculate correlation for each demographic variablefor (col in demographic_cols) { correlation <-cor(correlations$sales, correlations[[col]], use ="pairwise.complete.obs") correlation_results <-c(correlation_results, correlation)}# Combine demographic column names and correlation results into a dataframecorrelation_df <-data.frame(Demographic = demographic_cols, Correlation = correlation_results)#This code is with the use of ChatGPT# Print the correlation dataframecorrelation_df
Demographic
Correlation
lat
0.1445141
long
-0.1655860
This shows us that there is maybe a good idea to have stores with a higher value of latitude which is to the north, and a lower longtitude which is to the east so having your store in the north east chigago area might be something to look into.
Code
correlations <- demo_stata %>%pivot_wider(names_from = variable_name, values_from = dflong_value) #This code is with the use of ChatGPT# List of demographic columnsdemographic_cols <-colnames(correlations)[14:ncol(correlations)]# List of variables for which you want to calculate correlationsvariables <-c("sales", "profit", "custcount", "haba")# Initialize an empty dataframe to store correlation resultscorrelation_df <-data.frame()#This code is with the use of ChatGPT# Calculate correlation for each demographic variablefor (var in variables) { correlation_results <-c()for (col in demographic_cols) {# Ensure both columns are numeric before calculating correlationif (is.numeric(correlations[[var]]) &is.numeric(correlations[[col]])) { correlation <-cor(correlations[[var]], correlations[[col]], use ="pairwise.complete.obs") correlation_results <-c(correlation_results, correlation) } else { correlation_results <-c(correlation_results, NA) } } temp_df <-data.frame(Demographic = demographic_cols, Correlation = correlation_results, Variable = var) correlation_df <-rbind(correlation_df, temp_df)}# Subset the dataframe to keep only rows with correlation greater than 0.1 or less than -0.1. #This code is with the use of ChatGPTcorrelation_df <-subset(correlation_df, abs(Correlation) >0.1)#This code is with the use of ChatGPT#rownames(correlation_df) <- NULL# Print the correlation dataframecorrelation_df
Demographic
Correlation
Variable
5
% Working Women with full-time jobs
-0.1284505
sales
8
% of Retired
0.1091721
sales
10
% of Unemployed
0.1189193
sales
15
% of households with mortgages
-0.2543562
sales
16
% of households without mortages
0.2543562
sales
21
% of population that is neither black & hispanics or white
0.1850833
sales
26
% of women that arent working
0.1284505
sales
27
% of working women with children
-0.2062318
sales
28
% of working women with children 6 - 17
-0.2004089
sales
31
% Blacks & Hispanics
-0.2206697
profit
32
% College Graduates
-0.1124497
profit
45
% of households with mortgages
-0.1341649
profit
46
% of households without mortages
0.1341649
profit
47
% of non-working women with children
0.1022449
profit
49
% of people that arent college graduates
0.1124497
profit
52
% of population that is non-white
-0.2049068
profit
53
% of population that is white
0.2049068
profit
93
% Population over age 60
-0.1037599
haba
94
% Population under age 9
0.1508498
haba
100
% of Unemployed
0.2141608
haba
102
% of households with 2 persons
-0.1555487
haba
The variable “haba” means health and beauty sales.
What we can see here is that there might be some signs of there being some correlation with demographic data but we can not be sure. Further analysis is needed.
References
The data used for this assignment is gathered at the following links
We have also used the resources from the courseplan in SOK-1005 and used the lectures given by Dejene Gizaw Kidane to help us create the code to wrangle the data so there may be similarities at places in Task 1. However any and all errors are ours and ours alone.
Some of the code used to create some of the figures and datawrangling has been taken from our delivered assignments. The candidat numbers are added in the top of the document and we have delivered the assignments in Canvas with links to our Github pages. However for the purpose of anonymity of the authors of this task, our github pages contain our names so this should only first be checked after grading is final.
Appendix for AI usage
Task 1
In the beginning of Task 1 there has been use for ChatGPT to help define the start of “week 1” as we were unsure of how to convert the weekly data to date but it gave us the idea which we used to create the code. Codeline 144 and 152
We used ChatGPT to enter in the variable names after posting the variable names from the PDF. Codelines 163-173
In the code line 257 we had help from ChatGPT to write the REGEX code to help us pick out the brands as the brands had weird names with changing use of letters and such.
Finally we have used ChatGPT to help us translate the documentation we had made for Task1 to english by promting it with “translate to english” and adding our text. This was done to help grade the task.
Task 2
Helped make a trycatch code so if file isnt found, it is gotten from github where it is hosted
ChatGPT has been used to help create a better structure in the text we have made. This has been done by promting the model with our text and asking it to improve it. With some translations
Task 3
Minimal usage
Again used to help create better structure in the text and at times translate our norwegian comments.
Task 4
The use of ChatGPT is shown on this link and we translated one section to english in task 3.
The link shows us promting ChatGPT saying we have a dataframe in R and explaining a bit of the dataframe. The AI then gives us some code and i then specify we want specific columns to be checked. It then gives the code that is used at the end of Task 4 and it has been changed minimally.
Lastly it shows that i asked it to translate a section of Task 3 to to english which was done to help grade this task.